Exploratory Analysis of Seattle Collisions by Leah Erb

This report explores collision reports in Seattle, Washington, USA.

Dataset

The Seattle Collisions dataset is a compilation of over 200,000 collision reports created by Seattle Police Department (SPD) that were then recorded by Seattle Department of Transportation (SDOT), between the years 2004 and 2018.

Data Structure

Rows:  205380 
Variables:  28
'data.frame':   205380 obs. of  28 variables:
 $ ADDRTYPE       : Factor w/ 4 levels "","Alley","Block",..: 3 3 3 4 4 3 3 3 4 4 ...
 $ SEVERITYCODE   : Factor w/ 6 levels "","0","1","2",..: 3 3 3 4 4 3 4 3 4 4 ...
 $ SEVERITYDESC   : Factor w/ 5 levels "Fatality","Injury",..: 3 3 3 2 2 3 2 3 2 2 ...
 $ COLLISIONTYPE  : Factor w/ 11 levels "","Angles","Cycles",..: 5 11 6 2 3 7 9 11 5 9 ...
 $ PERSONCOUNT    : int  2 3 2 4 5 3 2 2 2 2 ...
 $ PEDCOUNT       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PEDCYLCOUNT    : int  0 0 0 0 1 0 0 0 0 0 ...
 $ VEHCOUNT       : int  2 2 2 2 1 2 2 2 2 2 ...
 $ INJURIES       : int  0 0 0 1 1 0 2 0 2 1 ...
 $ SERIOUSINJURIES: int  0 0 0 0 0 0 0 0 0 0 ...
 $ FATALITIES     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ INCDATE        : Date, format: "2013-04-02" "2013-03-30" ...
 $ INCDTTM        : Factor w/ 155661 levels "1/1/04","1/1/05",..: 81567 74101 20779 125708 81577 57671 153722 114998 81578 138038 ...
 $ JUNCTIONTYPE   : Factor w/ 8 levels "","At Intersection (but not related to intersection)",..: 6 6 6 3 3 6 5 5 3 3 ...
 $ SDOT_COLCODE   : int  11 11 11 11 51 11 14 11 11 14 ...
 $ INATTENTION    : Factor w/ 2 levels "","Y": 1 1 1 1 2 1 2 2 1 2 ...
 $ DUI            : Factor w/ 3 levels "","N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ WEATHER        : Factor w/ 10 levels "Blowing Sand or Dirt or Snow",..: 5 2 6 2 5 2 2 6 2 2 ...
 $ ROADCOND       : Factor w/ 10 levels "","Dry","Ice",..: 2 2 10 2 2 2 2 10 2 2 ...
 $ LIGHTCOND      : Factor w/ 9 levels "","Dark - No Street Lights",..: 6 6 6 6 7 4 6 6 7 6 ...
 $ SPEEDING       : Factor w/ 2 levels "","Y": 1 1 2 1 1 1 1 1 1 1 ...
 $ Year           : Factor w/ 15 levels "2004","2005",..: 10 10 3 2 10 3 1 2 10 2 ...
 $ Month          : Factor w/ 12 levels "1","2","3","4",..: 4 3 10 7 4 2 9 6 4 8 ...
 $ Time           : POSIXct, format: "2019-02-26 15:10:00" "2019-02-26 14:00:00" ...
 $ Hour           : int  15 14 10 11 19 0 12 NA 19 7 ...
 $ DayPart        : Factor w/ 4 levels "Afternoon","Evening",..: 1 1 3 3 2 4 1 NA 2 3 ...
 $ SDOTtype       : Factor w/ 7 levels "Head On","Hits Pedestrian",..: 6 6 6 6 7 6 5 6 6 5 ...
 $ Season         : Factor w/ 4 levels "Fall","Spring",..: 2 2 1 3 2 4 1 3 2 3 ...
         ADDRTYPE      SEVERITYCODE               SEVERITYDESC   
             :  3608     :     0    Fatality            :   315  
 Alley       :   809   0 : 19559    Injury              : 54377  
 Block       :134732   1 :128276    Property Damage Only:128276  
 Intersection: 66231   2 : 54377    Serious Injury      :  2853  
                       2b:  2853    Unknown             : 19559  
                       3 :   315                                 
                                                                 
    COLLISIONTYPE    PERSONCOUNT       PEDCOUNT        PEDCYLCOUNT     
 Parked Car:45855   Min.   : 0.00   Min.   :0.00000   Min.   :0.00000  
 Angles    :32752   1st Qu.: 2.00   1st Qu.:0.00000   1st Qu.:0.00000  
 Rear Ended:32404   Median : 2.00   Median :0.00000   Median :0.00000  
           :23778   Mean   : 2.23   Mean   :0.03707   Mean   :0.02691  
 Other     :22952   3rd Qu.: 3.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Sideswipe :17385   Max.   :93.00   Max.   :6.00000   Max.   :2.00000  
 (Other)   :30254                                                      
    VEHCOUNT         INJURIES       SERIOUSINJURIES      FATALITIES      
 Min.   : 0.000   Min.   : 0.0000   Min.   : 0.00000   Min.   :0.000000  
 1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.:0.000000  
 Median : 2.000   Median : 0.0000   Median : 0.00000   Median :0.000000  
 Mean   : 1.738   Mean   : 0.3741   Mean   : 0.01511   Mean   :0.001641  
 3rd Qu.: 2.000   3rd Qu.: 1.0000   3rd Qu.: 0.00000   3rd Qu.:0.000000  
 Max.   :15.000   Max.   :78.0000   Max.   :41.00000   Max.   :5.000000  
                                                                         
    INCDATE              INCDTTM      
 Min.   :2004-01-01   11/2/06:   103  
 1st Qu.:2007-04-12   10/8/04:    98  
 Median :2011-01-29   10/3/08:    92  
 Mean   :2011-03-16   11/5/05:    85  
 3rd Qu.:2015-02-03   1/2/04 :    80  
 Max.   :2018-12-31   8/6/04 :    79  
                      (Other):204843  
                                            JUNCTIONTYPE    SDOT_COLCODE  
 Mid-Block (not related to intersection)          :93480   Min.   : 0.00  
 At Intersection (intersection related)           :63603   1st Qu.:11.00  
 Mid-Block (but intersection related)             :23747   Median :11.00  
 Driveway Junction                                :10970   Mean   :13.38  
                                                  :10954   3rd Qu.:14.00  
 At Intersection (but not related to intersection): 2419   Max.   :69.00  
 (Other)                                          :  207                  
 INATTENTION DUI                          WEATHER      
  :177419     : 23758   Clear or Partly Cloudy:106169  
 Y: 27961    N:172648   Unknown               : 38763  
             Y:  8974   Raining               : 31724  
                        Overcast              : 26396  
                        Snowing               :   818  
                        Other                 :   782  
                        (Other)               :   728  
       ROADCOND                        LIGHTCOND      SPEEDING  
 Dry       :119152   Daylight               :110852    :196105  
 Wet       : 45175   Dark - Street Lights On: 46431   Y:  9275  
           : 23875                          : 24021             
 Unknown   : 14750   Unknown                : 13203             
 Ice       :  1140   Dusk                   :  5647             
 Snow/Slush:   923   Dawn                   :  2385             
 (Other)   :   365   (Other)                :  2841             
      Year            Month            Time                    
 2005   : 16016   10     :18967   Min.   :2019-02-26 00:01:00  
 2006   : 15794   11     :17645   1st Qu.:2019-02-26 09:53:00  
 2004   : 15457   5      :17614   Median :2019-02-26 14:23:00  
 2007   : 15082   8      :17550   Mean   :2019-02-26 13:41:57  
 2015   : 14260   6      :17498   3rd Qu.:2019-02-26 17:48:00  
 2008   : 14139   7      :17448   Max.   :2019-02-26 23:59:00  
 (Other):114632   (Other):98658   NA's   :49797                
      Hour            DayPart                 SDOTtype        Season     
 Min.   : 0.00   Afternoon:52009   Head On        : 7062   Fall  :53586  
 1st Qu.: 9.00   Evening  :28499   Hits Pedestrian:17846   Spring:50957  
 Median :14.00   Morning  :39398   Non-Collision  :  393   Summer:52496  
 Mean   :13.25   Night    :35677   Other Collision:30343   Winter:48341  
 3rd Qu.:17.00   NA's     :49797   Rear End       :61677                 
 Max.   :23.00                     Sideswipe      :86569                 
 NA's   :49797                     Struck Object  : 1490                 

The dataset consists of 28 variables for 205,370 observations.

It will be interesting to see what factors may contribute to Seattle collisions, and if they change over time.


Univariate Plots Section

Have the total number of collisions changed over the years?

The distribution of collision frequency by year appears bimodal, peaking at 2005 and, to a lesser extent, 2015, with a low at 2010 and again in 2018. This plot hints at a possible 5-year pattern.

I had expected to see a steady increase in Seattle collisions, mirroring population growth. Perhaps I’m wrong about the steady increase in population, or that the number of reported collisions have a strong correlation with the population in the first place.


Do collision frequencies change throughout the day?

It appears that collision counts tend to rise during commute hours, starting at the lowest point around 4am.


What are the COLLISION types, according to the SPD?

Our dataset includes 2 Collision type (category) variables:

  1. COLLISIONTYPE reported by SPD (Seattle Police Department)

  2. SDOTtype recorded by SDOT (Seattle Department of Transportation)


What are the collision types as recorded by SDOT?


SPD has 11 collision categories while SDOT‘s has 7. Which type I choose to include in this study could have an impact on our analysis. For example, if we are analyzing collisions involving Pedestrians, SPD reports only 1,770 ’Pedestrian’, while SDOT reports ‘Hits Pedestrian’ 17,846 times. That’s a nearly 10x difference. Also, SDOT’s ‘Rear End’ count is almost twice as many as SPD’s.

It is possible that SPD has reason to categorize collisions differently than SDOT. The reasons are outside the scope of this project. Nonetheless, it would be interesting to look into how Collision types are recorded, and if it would make a differences in our analysis.

In this report, except where otherwise noted, will be using SDOT’s Collision categories.


What are the SEVERITY categories?

Observations:

  • This plot orders the Severity bins in order of worsening severity.

  • The frequency of collisions by Severity is right skewed (towards higher severity).

  • Over 60% of collisions result in Property Damage Only (no injuries or deaths).

The Severity factor alone does no quantify the damage done by collisions. We could make a rough attempt to quantify damage by using columns VEHCOUNT, INJURIES, SERIOUSINJURIES and FATALATIES in a calculation, but it would be a crude quantification.


Geographical, environmental and time factors

Let’s get a rough idea of how counts are distributed amongst these types of categorical factors. Don’t worry about not being able to read the x-axis labels yet :-> .

Observations:

  • Location (AddrType and JunctionType) each have a two predominant types. We might dig into the these two variables later, to see if JunctionType is a subset of AddrType.

  • Environmental conditions (Weather, LightCond, RoadCond) each have a dominant categories. Earlier we saw that most collisions happen in the middle of the day, so I suspect that is represented here (Clear, Daylight, Dry).

  • Time factor (Season, DayPart) frequencies a more evenly spread, with the highest frequency in Summer Afternoons.


The Counts of Things

These are the quantitative variables, which are simply counts of things. Let’s see what they are and how they’re spread out.


Since many of the Counts of Things have 0 value, plot them again but this time with log10 so non-zero values are more visible.

    VEHCOUNT       PERSONCOUNT       PEDCOUNT        PEDCYLCOUNT     
 Min.   : 0.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.00000  
 1st Qu.: 2.000   1st Qu.: 2.00   1st Qu.:0.00000   1st Qu.:0.00000  
 Median : 2.000   Median : 2.00   Median :0.00000   Median :0.00000  
 Mean   : 1.738   Mean   : 2.23   Mean   :0.03707   Mean   :0.02691  
 3rd Qu.: 2.000   3rd Qu.: 3.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :15.000   Max.   :93.00   Max.   :6.00000   Max.   :2.00000  
    INJURIES       SERIOUSINJURIES      FATALITIES      
 Min.   : 0.0000   Min.   : 0.00000   Min.   :0.000000  
 1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.:0.000000  
 Median : 0.0000   Median : 0.00000   Median :0.000000  
 Mean   : 0.3741   Mean   : 0.01511   Mean   :0.001641  
 3rd Qu.: 1.0000   3rd Qu.: 0.00000   3rd Qu.:0.000000  
 Max.   :78.0000   Max.   :41.00000   Max.   :5.000000  


The Counts of Things frequencies are all right-skewed. Only the VEHCOUNT and PERSONCOUNT peak at 2 instead of 0.


Let’s look closer at Vehicle and Person counts …

VEHCOUNT

“The number of vehicles involved in the collision. This is entered by the state.”

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.000   2.000   1.738   2.000  15.000 

The vast majority of collisions involve 2 vehicles. The mean, median, Q2 and Q3 are all roughly the same (2), as verified with the flat box plot.


PERSONCOUNT

“The total number of people involved in the collision”

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.00    2.00    2.23    3.00   93.00 


Observations: * Most collisions involve 2 people, the next highest-count being 3 persons. * Some reportedly have zero PERSONCOUNT. * Median and Q1 are the same (2), with Q3 being just one away, then a very long tail to a maximum of 93. * There is no data available to explain the circumstances of the larger PERSONCOUNTs.


How about driver diminished capacity and error (DUI, Inattention, Speeding)?


Observations:

  • Positive DUI and INATTENTION were reported in approximately 11-13% of collisions.

  • Surprisingly, SPEEDING is infrequently noted in collision reports.

Note: INATTENTION and SPEEDING have only either a Y(es) or NULL value, suggesting that the lack of value (NULL) means these are not contributing factors.

However, fewer DUI are NULL, as most are marked with either (Y)es or (N)o values. It is not apparent whether the absence of a value is an oversight, if the presence of DUI was inconclusive, or if there is some other meaning. For the rest of this report, unless otherwise stated, NULL values for DUI are not included in our analysis when considering the DUI variable.


Univariate Analysis

What is the structure of your dataset?

There were 205,381 collisions reported by SPD between the years 2004 and 2018.

Most of the variables are categorical, describing:

  • environmental factors
  • location factors
  • points in time

Boolean-type variables (Yes, No, or NULL) include:

  • Inattention
  • DUI
  • Speeding

Quantitative variables include counts of:

  • bicycles
  • vehicles
  • people
  • injuries
  • fatalities

Other observations:

Most collisions:

  • happen in a ‘block’ location (as opposed to intersection or alley)
  • involve 2 vehicles
  • occur on a clear or partly cloudy day, during daylight on a dry road
  • are categorized as a ‘Rear End’ or ‘Sideswipe’
  • usually involve property damage only, as opposed to injury or fatality

What is/are the main feature(s) of interest in your dataset?

The main features of interest for me are:

  • Time: are there trends across the years, or during an average day?
  • DUI: are there patterns to DUI collisions?
  • Collision Type: Possible disparities between SPD’s and SDOT’s classification of collision types.
  • Severity: What kind of collisions have higher Severity.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Counts of Things may help explain possible SPD and SDOT collision classification disparities.

Counts of Things may also help explain Severity classifications.

Also, geographical and environmental variables may help support the investigation of main features.

Did you create any new variables from existing variables in the dataset?

Yes. I created point-in-time variables to help make reporting and graphing easier:

  • Year
  • Season
  • Hour
  • DayPart (morning, afternoon, etc)

I created a new collision categorical variable (SDOTtype) to group common SDOT’s collision types. For example, I combined 6 types of ‘Sideswipe’ collision types into one category called ‘Sideswipe’.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I log-transformed the strongly right-skewed Counts of Things distributions (vehicles, people, injuries, etc). Both the vehicle and person counts peaked at 2 while all other counts peaked at 0. Person count, injuries and serious injuries have a long right tail. The log10 transformation made it easier to see counts other than the left-side peaks.

Other changes:

  • Original values if DUI included: 0, 1, Y, N and null. In order to standardize the values, I changed 0 to N and 1 to Y.

  • I deleted rows with Incident Dates before 2004 or after 2018. Original records outside that range contained only partial-year records.

  • There was a single record with a null value for SEVERITYCODE. I first verified there were no interesting anomalies in that record (like a large Fatality count), then deleted it. Otherwise, every graph would have included an additional tick on an axis, or an additional legend, just for that one record.

  • Original WEATHER values included both ‘Unknown’ and ‘’ (empty space). I standardized by combining both as ’Unknown’.

  • Original SEVERITYDESC string values were postfixed with ‘Collision’, which made labels unnecessarily long, so I removed those postfixes.

  • I deleted columns from the dataset that were not of interest for this study (such as report numbers and administrative codes.)

  • I renamed two columns for easier programming and plot reading:

  • DUI <- UNDERINFL (“Whether or not a driver involved was under the influence of drugs or alcohol.”)
  • INATTENTION <- INATTENTIONID (“Whether or not collision was due to inattention. (Y/N)”)


Bivariate Plots Section

From this correlation chart, we see moderately strong relationships between:

**DUI* and …

Vehicle Count and …

Let’s view these stronger correlations in a couple different ways …


Correlations table

                DUI VEHCOUNT SEVERITYCODE COLLISIONTYPE WEATHER
DUI            1.00     0.65         0.51          0.48   -0.57
VEHCOUNT       0.65     1.00         0.38          0.48   -0.49
SEVERITYCODE   0.51     0.38         1.00          0.24   -0.43
COLLISIONTYPE  0.48     0.48         0.24          1.00   -0.34
WEATHER       -0.57    -0.49        -0.43         -0.34    1.00
PERSONCOUNT    0.38     0.55         0.37          0.25   -0.32
ROADCOND       0.27     0.23         0.14          0.18    0.32
LIGHTCOND      0.58     0.60         0.39          0.45   -0.34
              PERSONCOUNT ROADCOND LIGHTCOND
DUI                  0.38     0.27      0.58
VEHCOUNT             0.55     0.23      0.60
SEVERITYCODE         0.37     0.14      0.39
COLLISIONTYPE        0.25     0.18      0.45
WEATHER             -0.32     0.32     -0.34
PERSONCOUNT          1.00     0.11      0.30
ROADCOND             0.11     1.00      0.29
LIGHTCOND            0.30     0.29      1.00


Easy-reference plot


It is clear that DUI and VEHCOUNT have the strongest relationships.

create_summary <- function(data, col1, col2) { col1 <- enquo(col1) col2 <- enquo(col2) result_summary <- data %>% group_by(!! col1, !! col2) %>% summarise(count = n()) %>% mutate(perc = count/sum(count)) mutate(label = percent(perc %>% round(5))

result_summary }


COLLISIONTYPE and VEHCOUNT

COLLISIONTYPE and VEHCOUNT have a moderately-strong relationship (.5). Let’s see what they look like.

It appears that ‘Rear End’ collisions involve noticably more vehicles in single incidents. This chart shows percentages, not counts, so we do not know from this plot just how many collisions involve these multiple vehicles.


Has the rate of DUI collisions changed over the years?

Interestingly, the lowest recorded DUI rate was the same year as our second-highest total collision year (2015). What happened in 2015 to cause such a drop?

However, when we plot the rate of DUI in perspective with all collisions, the DUI rate changes lose their punch and appear relatively steady.


Let’s look at the same plot for INATTENTION.

Although INATTENTION was not strongly correlated with other factors, we did see in the Univariate section that INATTENTION has a similar rate as DUI, so let’s see how the INATTENTION rate changed over the years.

There was a significant increase in INATTENTION being a factor in collisions, starting in the year 2013. Perhaps this is a result of the increase in the use of mobile devices, or directives for the SPD to record ‘inattention’ as a factor in collisions. More data is required to do any more analysis, outside the scope of this study.


SPD Reports vs SDOT Records

In the Univariate section, we saw a possible disparity between SPD’s reporting of collision factors and SDOT’s follow-up recording of these factors. Let’s look into this a little more.


View By Count

It’s interesting that SPD reports categorizing a collision as ‘Sideswipe’, eventually gets categorized by the SDOT as ‘Other Collision’ or ‘Rear End’. Also, it appears that SPD ‘Pedestrian’ labels frequently show up in SDOT records as ‘Head On’.


View By Percent

View the same information as the plot above, but show the percentages of COLLISIONTYPE’s within each SDOTtype, instead of count:

Plotting by percentage makes it easier to see the spread of COLLISIONTYPEs. For example, we can now easily see that SPD ‘Sideswipe’ (yellow) are also found in SDOT ‘Hits Pedestrian’ records.

Also, we can now see that SPD reported collisions involving Cycles are spread-out over several SDOT categories.

It is possible that SDOT reclassifies Collision Types after all the dust settles and more facts are available. It would be interesting to dig a little deeper into reclassifications.


Let’s do a similar study of JUNCTIONTYPE vs ADDRTYPE.

It appears that the the reportings of JUNCTIONTYPE and ADDRTYPE classifications are fairly in sync, that JUNCTIONTYPE is used as a sub-category to ADDRTYPE.


Has the distribution of any Collision types rates changed over the years?

Surprisingly, it appears the % of collisions with ‘Hits Pedestrian’ rose significantly around the years where there were fewer total collisions (2010-2011). Are there more pedestrians on the road, and fewer drivers? Maybe more inattentive pedestrians getting hit? That would be an interesting further study, but outside the scope of this project since INATTENTION does not specify who was inattentive.

It will be interesting to look more into the Hits Pedistrian increase.


Adjusted Hits Pedestrian

There are several variables that indicate a Pedestrian was involved in a collision (COLLISIONTYPE, SDOTtype, PEDCOUNT). Are all pedestrian-related collisions being marked by SDOT as ‘Hits Pedestrian’?

Let’s Bundle all records with any Pedestrian indicator into ‘Hits Pedestrian’. Will the increased trend of ‘Hits Pedestrian’ records stay the same, or even out across the years?

Here we see approximately 25% of the records with some indication of Pedestrian involvement were originally classified as ‘Head On’. This suggests that most pedestrians are hit with the front of a car, as opposed to sideswiped or rear-ended.

If we re-classify all the above as ‘Hits Pedestrian’, will the trend change?

Comparing the original and the Adjusted plots, it appears that the ratios are pretty much stay the same, except some of the ‘Head On’ records got moved to ‘Hits Pedestrian’.

Theory: Whatever causes the SPD and SDOT discrepancies in ‘Pedestrian’ reporting appears to be consistent, and the increased rate in ‘Hits Pedestrian’ is true (we cannot reject the null hypothesis that pedestrian’s getting hit rate did not change.)


DUI relationships


SEVERITY

The proportions appear similar between non-DUI and DUI, in that the more serious the Severity, the fewer the collisions count, whether DUI is true or not.

What if we did look at the proportions instead of counts, will the proportions still appear the same, as suggested as in the plot above? While we’re at it, let’s compare proportions for the INATTENTION variable.

The proportions of Severity for DUI are similar but not the same. Here we see there are more serious severities in DUI collisions.

Interestingly, the rates of Severity between DUI and INATTENTION indicators are very similar, although Severity tends to be a worse in DUI collisions.

Later, let’s see if other factors come into play with the seriousness of DUI. For example, perhaps more DUI collisions occur at night, when people tend to inbibe.

 

Out of curiosity, how often are DUI and INATTENTION both marked as factors on the same Collision report?

Most reports with any DUI value (‘Y’ or ‘N’), INATTENTION is not marked. All reports with no DUI value, INATTENTION also has no value.

This plot was just for curiosity, I will no longer pursue INATTENTION analysis.


Other DUI relationships

Light Conditions has the most significant differences between non-DUI and DUI. There are significantly more collisions in the ‘Dark - Street Lights On’ when DUI. Whether or not this is simply a factor of time (night time being when the streetlights are on and, perhaps, when more people are DUI), and less to do with the fact that the street lights are on, is a question requiring further study.

Although Weather Condition had a strong correlation (.6) with DUI, there does not appear to be a significantly difference of Weather in the non-DUI collisions. Although, it is worth noting that very few collision reports neglected to report the WEATHER (‘Unknown’) for DUI collisions. This may be another indication that SPD reports are more completely filled-out when DUI is suspected.

Collision Type of ‘Other’ is more than double when DUI is True, but Sideswipes are significantly less, as are ‘Hits Pedestrian’.


Let’s look a little closer at the SDOTtype and LIGHTCOND plots in larger scale.

DUI and SDOTtype

It would be interesting to break-down ‘Other’ Collision Types for more details.

DUI and LIGHTCOND

Is it possible that glare from Street Lights could be a factor in DUI collisions?
(digging into this more is way outside the scope of this project)


How are counts by LIGHTCOND spread out across the day?

Higher counts by LIGHTCOND at certain parts of the day hold no surprises: collisions in afternoons will naturally tend to be during Daylight.

It is interesting that most collisions occur during the afternoon. Is road volume a contributing factor?


LIGHTCOND and TIME line graph.

Let’s look at the same data, in a line graph format. It may be easier to see what’s going on. Let’s also look at percentages instead of counts.

Again, no surprises here … LIGHTCOND and Hour are paired.


Is there a change in hourly-collision patterns between the 2-highest and 1-lowest years?

2015, our second-hightest ranking year, has a more distinctive peak during the lunch hours, as well as overall increase during the day. Interestingly, this same hear saw a smaller midnight collision count than the other two years.

2005, our lowest year, had a small drop in the morning commute hours that the other two years did not. Plus, there is a plateau of collision counts around the lunch hour that the other years did not have, and the spike in the evening commute is less pronounced.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Collision Types and Vehicle Counts

  • There are a few patterns to Collision types and the number of vehicles that tend to be involved in each type. Apparently, ‘Sideswipe’ incidents involve anywhere bewteen 2 and 9 vehicles. Most ‘Pedestrian’ incidents involve just one car, but there is at least one ‘Pedestrian’ that involved 7 cars.

DUI and other factors

  • The total number of DUI reports have changed over the years, but the rate of change is not dramatic.

  • In the span of the average day, there is a definite trend in that DUI’s generally occur in the evening and night hours.

  • DUI’s are involved in fewer ‘Sideswipe’ and ‘Pedestrian’ collision types, and more ‘Other’, suggesting more digging into ‘Other’ is needed.

  • DUI incidents tend to have higher severities.

SPD vs SDOT Classifications

  • There is a difference between SPD’s and SDOT’s categorization of collisions, which is interesting but appear to be consistent. It was SPD’s COLLISIONTYPE that proved to have stronger correlations with other factors, but for the sake of simplicity in my analysis, I chose to continue using SDOT’s categories in my plots.

  • There appeared to be anomalies when it came to SPD’s reporting of Pedestrian related collisions, and SDOT’s recordings of them. It turns out that whatever is happening between reporting and recording is consistent. It appears that SDOT categorizes some Pedestrian hits as ‘Head On’ as opposed to ‘Hits Pedestrian’.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  • INATTENTION increased significantly starting around 2013.

  • Most collisions happen in LIGHTCOND = ‘Daylight’, and during the afternoons.

  • I’m surprised that LIGHTCOND did not have a higher score with Hour in the correlation tables, and had a weak relationship with DayPart.

  • It appears that JUNCTIONTYPE is a subset of ADDRTYPE.

  • Weather (surprisingly, the cor value is high, but there is little difference between DUI and non-DUI weather, other than fewer DUI records are marked with WEATHER=‘Unknown’)

  • Light Condition (not surprisingly, DUI collisions are mostly in Street Light conditions)

What was the strongest relationship you found?

In general, DUI and VEHCOUNT each have the strongest relationships with other features.

The strongest relationship was between DUI and LIGHTCOND. The LIGHTCOND of ‘Street Lights On’ more than doubles when DUI is true. However, this may very well be a ‘correlation is not causation’ example, in that DUI people may more frequently be driving at night when street lights just happen to be on.


Multivariate Plots Section

In this section, I dig deeper into exploring relationships between:


How does the rate of DUI collisions changes compare to total collisions over the years?

Although DUI rates change very little over the years, they do seem to drop in rate when the total number of collisions rise.

/

When DUI’s happen, compared to when Injuries happen.

This plot suggests that, although the total number of collisions between midnight and 4am are at their lowest, most of these collisions can be attributed to DUIs.

Also, the number of serious injuries increases during commute hours (4-6pm), and the number of fatalities his highest around 6pm, and around 9pm.


What kind of collisions involve fatalities, and on average how many annually?

In collisions that involved fatalities, on average most were ‘Hits Pedestrian’ and ‘Other Collision’. There were a few outliers in each of ‘Sideswipe’ and ‘Other Collision’, with more than 2x the annual average fatalities in certain years. Not surprisingly, there were the fewest fatalities in ‘Struck Object’.

More analysis would be required to figure out what ‘Other Collision’ involves. Looking at SPD’s COLLISIONTYPE helps, but they even have a significant ‘Other’ category with a significant number of ‘Other’ fatalities. But we can see here that some annual fatalities involve ‘Cycles’ and ‘Head On’:


SPD vs SDOT with VEHCOUNTS

In the Bivariate section, we explored the SPD’s Collision relationship with VEHCOUNT. Now let’s also compare VEHCOUNT with SDOT’s Collision categories, to see what the differences are, if any.

Where as the SPD tends to not categorize all the VEHCOUNT=0 reports (COLLISIONTYPE=‘’), SDOT follows these incidents up by assigning their own categorization to the incident. Also, in the VEHCOUNT=12 column (there may have been only 1 VEHCOUNT=12 record), the SPD marked it as ’Parked Car’ while SDOT followed-up and categorized the incident(s) as ‘Hits Pedestrian’. There are a number of other inconsistencies that would require further analysis outside the scope of this project.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The DUI / Light Condition / Severity relationship is strong: DUI collisions tend to happen when ‘Street Lights - On’, and they also tend to have higher severities.

Also, DUI / Time of Day / Severity relationships are strong, in that LIGHTCOND and Time are strongly paired.

It appears that if there is an incident between approximately 12am and 4am, it will most likely involve a DUI.

Looking averages across the years, reports suggest that most fatalities occur with ‘Hits Pedestrian’ and ‘Other’.

Were there any interesting or surprising interactions between features?

The disparity between SPD and SDOT collision categorization is made clearer when looking at Fatality averages. It is unclear why SDOT categorizes some ‘Other’ collisions when SPD categorizes them as ‘Head On’, and some of those may involve pedestrians.

I am surprised to see that (generally) when annual DUI rates are up, total collision incident counts are down.


Final Plots and Summary

Description One

I wanted to introduce the Seattle Collisions dataset with a simple visual illustrating collision incident trends over time (across the Years, and through the course of an average day.)

The By Year plot shows the presence of a repeated rise-and-fall trend, highlighting 2 peak years (2005 and 2015) and 1 valley year (2010).

The By Time of Day presents a simple trend of collision frequencies throughout any given day, with a trend of increasing incidents from morning until afternoon commute hours.

Design justifications

I chose to use a 2-column grid, as opposed to a 1-column/2-row grid, so the y-axis would be taller. I did this because the range of incident counts (y-axis) is fairly large (up to 15,000), and wanted the plot to represent the large change in counts from year to year. If I used a 1-column grid, the difference in years would presented as subtle. If I had used the more subtle chart, I felt the need to print a rate changes on each bin to show the differences, which proved to be too distracting.

I debated whether to include a point-line in the By Time of Day plot, to show where points actually sat in relation to the smoother slope. I decided to keep the plot simple, as it’s purpose was to simply show the general trend of collision incidents throughout the day.

I chose a grey color scheme to keep the charts simple. I did not want colors to distract from the main point (the trends). However, I did vary a scale of grey on the 2-highest and 1-lowest incident count By Years to draw the audience’s attention to those extremes.


Plot Two - Categorical Differences

Description Two

I chose this plot because it shows the disparity between how the SPD (Seattle Police Department) and SDOT (Seattle Department of Transportation) choose to categorize each incident. For example, the left two bins show that when SDOT categorizes an incident as ‘Heads On’, the SPD had already categorized it as ‘Pedestrian’. The SPD and SDOT had the same classification for very few of the ‘Heads On’ and ‘Pedestrian’ labels. Less than 50% of the time they agreed on ‘Rear Ended’ classifications.

It is possible that both systems of categorization are coordinated intentionally, that the absense of matching values is not of disagreement but that by using both we get better picture of each incident. It may also suggest the dataset could be enhanced if each category was a boolean factor, as opposed to two single-value factors.

Design justifications

I chose bright colors (from a colorblind-friendly color scheme) to emphasize the SPD vs SDOT disparities. I hope the audience’s first impression to see the presense of a complication, and interesting enough to spend a few moments comparing the SPD vs SDOT incident classifications.


Plot Three - DUI Severities

Description Three

I chose this plot to illustrate the increased rate of injury and death in DUI incidents versus non-DUI incidents. DUI incidents have a 10 times greater chance of resulting in Fatalities, and more than 3 times greater chance of Serious Injuries.

Design justifications

I debated whether or not to use points in this plot, since the number of DUI incidents are far fewer than non-DUI, therefore resulting in fewer points on the plot (even though the top three rates are higher). I decided to include the points because I wanted to demonstrate that, even though there are fewer DUI incidents, their rate of fatalities and injuries are much higher.

I also removed the color legend because it would have been redundant with the y-axis labels.


Reflection

The dataset

I wanted to challenge myself with this project by finding my own dataset to use.

I initially thought the Seattle Collisions was clean. While exploring the data, I progressively realized that was not quite true. I discovered the hard way that cleaning does take significant time (as warned in the project instructions!) Not just the time for writing code, but the time it takes to dig deeper into the data in order to make ethical decisions on how best to clean while avoiding the unintentional misrepresention of data.

Also, this dataset has very few quantitative factors, making it challenging to create a variety of graph types. That being said, I learned more about plotting in R than I may not have otherwise. I learned quite a bit by experimenting with color, scale, labels, and which plot types to use with which data type.

I enjoyed the ‘running commentary, stream of thought’ nature of this data exploration project. By not strategizing too much ahead of time exactly what I wanted to conclude at the end, I made a few surprising discoveries.

The data

I was surprised that annual collision counts did not steadily rise over the years, but has a distinct rise/fall pattern. Additional analysis in future work should introduce Seattle Population estimates, to see how it correlates with collision data, and patterns of incident factors over time.

I was not surprised there is an increase in the rate if ‘Inattention’ collisions, given the rise in the use of mobile devices. This may also be an interesting study for future work. What would ‘Innattention’ patterns reveal with more study of how (and how often) mobile devices are used by car occupants as well as pedestrians and cyclists.

Looking into the SPD vs SDOT categorizations made it more real to me how important it is to know how data is collected and stored before it even gets to the analyst, otherwise it’s difficult to make any ascertions about the data and what it represents. For example, are the SPD and SDOT categorizations of each incident a coordinated effort to describe an incident, or is data collection inconsistent or incomplete?